Ordination is a useful tool for visualising multivariate data. These notes focus on explaining and demonstrating principal component analysis in the R programming language. Examples will use morphological data from six species of fig wasps in a community associated with the Sonoran Desert Rock Fig (Ficus petiolaris). These notes are also available as PDF and DOCX documents.
Ordination is a method for investigating and visualising multivariate data. It allows us to look at and understand variation in a data set when there are too many dimensions to see everything at once. Imagine a data set with four different variables. I have simulated one below.
| Variable_1 | Variable_2 | Variable_3 | Variable_4 | |
|---|---|---|---|---|
| Sample_1 | 4.5466071 | 5.494013 | 1.1928819 | 7.5155256 |
| Sample_2 | 0.9502493 | 2.500261 | 0.3710835 | -2.8864147 |
| Sample_3 | 7.0992270 | 5.137512 | -3.8809346 | -3.5856435 |
| Sample_4 | 5.3520635 | 5.969104 | 6.4208561 | 0.4038182 |
| Sample_5 | 0.4745339 | 2.022876 | -0.4423745 | 1.3753561 |
| Sample_6 | 3.7412326 | 5.699307 | 7.0081519 | 5.9734045 |
| Sample_7 | -3.3022738 | -4.648136 | 1.0994441 | 5.0241507 |
| Sample_8 | 6.0273319 | 4.952885 | 3.3619108 | 3.3858250 |
| Sample_9 | 3.0707621 | 1.082454 | 3.6368359 | 3.9896739 |
| Sample_10 | -0.0737763 | -1.186276 | 5.0562494 | 2.6314378 |
| Sample_11 | -3.2838354 | -2.287183 | -3.5745355 | 4.2392810 |
| Sample_12 | 4.2548760 | 5.167029 | 7.1117294 | 7.7964904 |
We could look at the distribution of each variable in columns 1-4 individually using a histogram. Or we could use a scatterplot to look at two variables at the same time, with one variable plotted in one dimension (horizontal x-axis) and a second variable plotted orthogonally (i.e., at a right angle) in another dimension (y-axis). I have done this below with a histogram for Variable_1, and with a scatterplot of Variable_1 versus Variable_2. The numbers in the scatterplot points correspond to the sample number (i.e., rows 1-12).
Since we do not have four dimensions of space for plotting, we cannot put all four variables on a scatterplot that we can visualise with an axis for each variable. Hence, in the scatterplot to the right above, we are ignoring Variable_3 and Variable_4.
For variables 1 and 2, we can see that Sample_1 is similar to Sample_12 because the points are close together, and that Sample_1 is different from Sample_7 because the points are relatively far apart. But it might be useful to see the distance between Sample_1, Sample_12, and Sample_7 while simultaneously taking into account Variables 3 and 4. More generally, it would be useful to see how distant different points are from one another given all of the dimensions in the data set. Ordination methods are a tools that try to do this, allowing us to visualise high-dimensional variation in data in a reduced number of dimensions (usually two).
It is important to emphasise that ordination is a tool for exploring data and reducing dimensionality; it is not a method of hypothesis testing (i.e., not a type of linear model). Variables in ordinations are interpreted as response variables (i.e., \(y\) variables in models where \(y_{i} = \beta_{0} + \beta_{1}x_{i} + \epsilon_{i}\)). Ordinations visualise the total variation in these variables, but do not test relationships between dependent (\(x\)) and independent (\(y\)) variables.
These notes will focus entirely on Principal Component Analysis (PCA), which is just one of many ordination methods (perhaps the most commonly used). After reading these notes, you should have a clear conceptual understanding of how a PCA is constructed, and how to read one when you see one in the literature. You should also be able to build your own PCA in R using the code below. For most biologists and environmental sciences, this should be enough knowledge to allow you use PCA effectively in your own research. I have therefore kept the maths to a minimum when explaining the key ideas, putting all of the matrix algebra underlying PCA into a single section toward the end.
Imagine a scatterplot of data, with each variable getting its own axis representing some kind of measurement. If there are only two variables, as with the scatterplot above, then we would only need an x-axis and a y-axis to show the exact position of each data point in data space. If we add a third variable, then we would need a z-axis, which would be orthogonal to the x-axis and y-axis (i.e., three dimensions of data space). If we need to add a fourth variable, then we would need yet another axis along which our data points could vary, making the whole data space extremely difficult to visualise. As we add yet more variables and more axes, the position that our data points occupy in data space becomes impossible to visualise.
Principal Component Analysis (PCA) can be interpreted as a rotation of data in this multi-dimensional space. The distance between data points does not change at all; the data are just moved around so that the total variation in the data set is easier to see. If this verbal explanation is confusing, that is okay; a visual example should make the idea easier to understand. To make everything easy to see, I will start with only two dimensions using Variable_1 and Variable_2 from earler (for now, just ignore the existence of Variables 3 and 4). Notice from the scatterplot that there is variation in both dimensions of the data; the variance of Variable_1 is 11.91, and the variance of Variable_2 is 12.83. But these variables are also clearly correlated. A sample for which Variable_1 is measured to be high is also very likely measure a high value of Variable_2 (perhaps these are measurements of animal length and width).
What if we wanted to show as much of the total variation as possible just on the x-axis? In other words, rotate the data so that the maximimum amount of variance in the full data set (i.e., in the scatterplot) falls along the x-axis, with any variation remaining being left to fall along the y-axis? To do this, we need to draw a line that cuts through our two dimensions of space in the direction where the data is most stretched out (i.e., has the highest variance); this direction is our first Principal component, PC1. I have drawn it below in red (left panel).
To build our PCA, all that we need to do is take this red line and drag it to the x-axis so that it overlaps with \(y = 0\) (right panel). As we move Principal Component 1 (PC1) to the x-axis, we bring all of the data points with it, preserving their distances from PC1 and each other. Notice in the panel to the right above that the data have the same shape as they do in the panel to the left. The distances between points have not changed at all; everything has just been moved. Sample_1 and Sample_12 are just as close to each other as they were in the original scatterplot, and Sample_1 and Sample_7 are just as far away.
Principal Component 1 shows that maximum amount of variation that is possible to show in one dimension while preserving these distances between points. What little variation that remains is in PC2. Since we have maximised the amount of variation on PC1, there can be no additional covariation left between the x-axis and the y-axis. If any existed, then we would need to move the line again because it would mean that more variation could be still placed along the x-axis (i.e., more variation could be squeezed out of the covariation between variables). Hence, PC1 and PC2 are, by definition, uncorrelated. Note that this does not mean that Variable 1 and Variable 2 are not correlated (they obviously are!), but PC1 and PC2 are not. This is possible because PC1 and PC2 represent a mixture of Variables 1 and 2. They no longer represent a specific variable, but a linear combination of multiple variables.
This toy example with two variables can be extended to any number of variables and principal components. PCA finds the axis along which variation is maximised in any number of dimensions and makes that line PC1. In our example with two variables, this completed the PCA because we were only left with one other dimension, so all of the remaining variation had to go in PC2. But when there are more variables and dimensions, the data are rotated again around PC1 so that the maximum amount of variation is shown in PC2 after PC1 is fixed. For three variables, imagine a cloud of data points; first put a line through the most elonated part of the cloud and reorient the whole thing so that this most elongated part is the width of the cloud (x-axis), then spin it again along this axis so that the next widest part is the height of the cloud (y-axis). The same idea applies for data in even higher dimensional space; the data are rotated orthogonally around each additional Principal Component so that the amount of variation explained with each gets progressively smaller (in practice, however, all of this rotating is done simultaneously by the matrix algebra).
There are a few additional points to make before moving on to some real data. First, PCA is not useful if the data are entirely uncorrelated. To see why, take a look at the plot of two hypothetical (albeit ridiculous) variables in the lower left panel. The correlation between the two variables is zero, and there is no way to rotate the data to increase the amount of variation shown on the x-axis. The PCA on the lower right is basically the same figure, just moved so that the centre is on the origin.
Second, if two variables are completely correlated, the maths underlying PCA does not work because a singluar matrix is created (this is the matrix algebra equivalent of dividing by zero). Here is what it looks like visually if two variables are completely correlated.
The panel on the left shows two perfectly correlated variables. The panel on the right shows what the PCA would look like. Note that there is no variation on PC2. One hundred percent of the variation can be described using PC1, meaning that if we know the value of Variable_1, then we can certain about the value of Variable_2.
I will now move on to introduce a morphological data set collected in 2010 from a fig wasp community surrounding the Sonoran Desert rock fig (Ficus petiolaris). These data were originally published in Duthie, Abbott, and Nason (2015), and they are publicly available on GitHub.
The association between fig trees and their pollinating and seed-eating wasps is a classic example of mutualism. The life-histories of figs and pollinating fig wasps are well-known and fascinating (Janzen 1979; Weiblen 2002), but less widely known are the species rich communities of non-pollinating exploiter species of fig wasps (Borges 2015). These non-pollinating fig wasps make use of resources within figs in a variety of ways; some lay eggs into developing fig ovules without pollinating, while others are parasites of other wasps. All of these wasps develop alongside the pollinators and seeds within the enclosed inflorescence of figs, and emerge as adults typically after several weeks. Part of my doctoral work focused on the community of fig wasps that lay eggs in the figs of F. petiolaris in Baja, Mexico (Duthie, Abbott, and Nason 2015; Duthie and Nason 2016).
The community of non-pollinators associated with F. petiolaris includes five species of the genera Idarnes (3) and Heterandrium (2). Unlike pollinators, which climb into a small hole of the enclosed infloresence and pollinate and lay eggs from inside, these non-pollinators drill their ovipositors directly into the wall of the fig. The left panel below shows a fig sliced in half (which often results in hundreds of fig wasps emerging from the inside). The right panel shows a fig on a branch.
The whole community is shown below. The pollinator is in panel ‘a’, while, panels ‘b-f’ show the non-pollinators that I will focus on here (panels ‘g’ and ‘h’ show a host-parasitoid pair).
As part of a test for evidence of a dispersal-fecundity tradeoff in the fig wasps in panels ‘b-f’, I estimated the mean wing loading of each species using body volume (body volume divided by wing surface area). This included 11 morphological measurements in total.
fig_wasps <- read.csv("data/wing_loadings.csv", header = TRUE);
fig_cols <- colnames(fig_wasps);
print(fig_cols);
## [1] "Species" "Site" "Tree"
## [4] "Fruit" "Head_Length_mm" "Head_Width_mm"
## [7] "Thorax_Length_mm" "Thorax_Width_mm" "Abdomen_Length_mm"
## [10] "Abdomen_Width_mm" "Ovipositor_Length_mm" "Ovipositor_Width_mm"
## [13] "Forewing_Area_mm.2" "Hindwing_Area_mm.2" "Forewing_Length_mm"
Below, I will run a PCA in R to look at the total variation in non-pollinating fig wasp morphology. To keep things simple, I will focus just on the two species of Heterandrium (panels ‘e, f’ in the image above).
First, I will trim down the data set in fig_wasps that I read in above to remove all species that are not in the genus Heterandrium. There are two species of Heterandrium that I will put into a single data set.
het1 <- fig_wasps[fig_wasps[,1] == "Het1",] # Species in panel 'f' above
het2 <- fig_wasps[fig_wasps[,1] == "Het2",] # Species in panel 'e' above
het <- rbind(het1, het2);
This leaves us with a data frame with 32 rows and 15 columns. In total, there are 16 measurements from samples of ‘Het1’ and 16 measurements from samples of ‘Het2’.
Because PCA requires matrix manipulations, we need the R object holding the data to be a matrix. To do this, we can get rid of the first four columns of data. These columns include the species names, and the site, tree, and fruit from which the fig wasp was collected.
dat <- het[,-(1:4)]; # Remove columns 1 through 4
We now need to define dat as a matrix.
dat <- as.matrix(dat);
This leaves us with the final version of the data set that we will use. It includes 11 columns of morphological measurements.
head(dat);
## Head_Length_mm Head_Width_mm Thorax_Length_mm Thorax_Width_mm
## [1,] 0.566 0.698 0.767 0.494
## [2,] 0.505 0.607 0.784 0.527
## [3,] 0.511 0.622 0.769 0.511
## [4,] 0.479 0.601 0.766 0.407
## [5,] 0.545 0.707 0.828 0.561
## [6,] 0.525 0.651 0.852 0.590
## Abdomen_Length_mm Abdomen_Width_mm Ovipositor_Length_mm
## [1,] 1.288 0.504 0.376
## [2,] 1.059 0.430 0.274
## [3,] 1.107 0.504 0.359
## [4,] 1.242 0.446 0.331
## [5,] 1.367 0.553 0.395
## [6,] 1.408 0.618 0.350
## Ovipositor_Width_mm Forewing_Area_mm.2 Hindwing_Area_mm.2
## [1,] 0.098 1.339 0.294
## [2,] 0.072 0.890 0.195
## [3,] 0.062 0.975 0.214
## [4,] 0.065 0.955 0.209
## [5,] 0.081 1.361 0.361
## [6,] 0.085 1.150 0.308
## Forewing_Length_mm
## [1,] 2.122
## [2,] 1.765
## [3,] 1.875
## [4,] 1.794
## [5,] 2.130
## [6,] 1.981
These 11 columns are our variables 1-11. Our fly measurements thereby occupy some position in 11-dimensional space on 11 orthogonal axes, which is way too complex to visualise all at once. We can first take a look at all of the possible scatterplots using pairs.
pairs(x = dat, gap = 0, cex.labels = 0.5);
There is not much that we can infer from this, except that most of the variables appear to be highly correlated. A PCA is therefore likely to be useful. Before building the PCA, it is probably a good idea to scale all of our variables. This would be especially important if we had variables measured in different units. If we scale the variables to have the same mean and standard deviation, then the PCA could potentially be affected by the units in which variables were measured, which does not make sense. If, for example, we had measured thorax length and width in \(mm^2\), but abdomen length and width in \(cm^{2}\), we would get thorax measurements contributing more to PC1 just because the measured values would be larger (if you do not believe me, try multiplying all values in one column of the data by 10 or 100, then re-run the PCA below). For data sets that include measurements with much different unit scales (e.g., pH versus elevation in metres above sea level), this is especially problematic. I will therefore scale the data below.
dat <- scale(dat);
Now every variable has a mean of zero and a standard deviation of one. We are finally ready to build our PCA, using only one line of code.
pca_dat <- prcomp(dat);
That is it. If we want to plot the first two principle components, we can do so with the following code.
par(mar = c(5, 5, 1, 1));
plot(x = pca_dat$x[,1], y = pca_dat$x[,2], asp = 1, cex.lab = 1.25,
cex.axis = 1.25, xlab = "PC1", ylab = "PC2");
Ignore the arguments that start with cex; these are just plotting preferences. But the argument asp = 1 is important; it ensures that one unit along the x-axis is the same distance as one unit along the y-axis. If this were not set, then the distance between any two points might be misleading. If, for example, moving 100 pixels on the x-axis correponsded to one unit on PC1, but moving 100 pixels on the y-axis corresponded to two units in PC2, then the variation in PC1 would look larger relative to PC2 than it should.
Note that we are not at all restricted to comparing PC1 and PC2. We can look at any of the 11 PCs that we want. Below compares PC1 with PC3.
par(mar = c(5, 5, 1, 1));
plot(x = pca_dat$x[,1], y = pca_dat$x[,3], asp = 1, cex.lab = 1.25,
cex.axis = 1.25, xlab = "PC1", ylab = "PC3");
Note that the points are in the same locations on the x-axis (PC1) as before, but not the y-axis (now PC3). We have just rotated 90 degrees in our 11-dimensional space and are therefore looking at our cloud of data from a different angle.
Recall that we had two groups in these data; two species of the genus Heterandrium. Now that we have plotted the position of all wasp measurements, we can also fill in the PCA with colours representing each species. This allows us to visualise how individuals in different groups are separated in our data.
par(mar = c(5, 5, 1, 1));
plot(x = pca_dat$x[,1], y = pca_dat$x[,2], asp = 1, cex.lab = 1.25,
cex.axis = 1.25, xlab = "PC1", ylab = "PC2");
h1_rw <- which(het[,1] == "Het1"); # Recall the original data set
h2_rw <- which(het[,1] == "Het2"); # Recall the original data set
points(x = pca_dat$x[h1_rw,1], y = pca_dat$x[h1_rw,2], pch = 20, col = "red");
points(x = pca_dat$x[h2_rw,1], y = pca_dat$x[h2_rw,2], pch = 20, col = "blue");
legend("topleft", fill = c("red", "blue"), cex = 1.5,
legend = c( expression(paste(italic("Heterandrium "), 1)),
expression(paste(italic("Heterandrium "), 2))));
From the groups overlaid onto the PCA, it is clear that these two species of Heterandrium differ in their morphological measurements. Often groups are not so clearly distinguished (i.e., data are usually more messy), and it is important to again remember that the PCA is not actually testing any statistical hypothesis. It is just plotting the data. If we wanted to test whether or not the difference between species measurements was statistically significant, then we would need to use some sort of appropriate multivariate test, such as a MANOVA (that is, a linear model with multiple continuous dependent variables and a single categorical independent variable), or an appropriate randomisation approach.
We can look at the amount of variation explained by each PC by examining pca_dat$sdev, which reports the standard deviations of the principal components.
pca_dat$sdev;
## [1] 2.5247440 1.3801063 1.0554983 0.7327957 0.5851569 0.5002464 0.4027246
## [8] 0.3451656 0.3368968 0.2424040 0.1538419
If we want to see the proportion of the total variation explained by each of the PCs, we can just calculate the variation in each PC divided by the total variation of all PCs.
pca_variance <- (pca_dat$sdev) * (pca_dat$sdev);
pca_pr_explained <- pca_variance / sum(pca_variance);
print(pca_pr_explained);
## [1] 0.579484759 0.173153952 0.101279703 0.048817227 0.031128058 0.022749679
## [7] 0.014744282 0.010830844 0.010318129 0.005341792 0.002151575
Note that all of the values above sum to 1. It appears that about 75.26 per cent of the total variation is explained by PC1 and PC2, so we are looking at a lot of variation in the PCA showed above. It is sometimes helpful to plot the proportion of the variation explained by each PC in a bar plot.
biplot
loadings
Those who are satisfied with their conceptual understanding of PCA, and the ability to use PCA in R, can stop here. In the next section, I will explain how PCA works mathematically.
Borges, Renee M. 2015. “How to be a fig wasp parasite on the fig-fig wasp mutualism.” Current Opinion in Insect Science 8. Elsevier Inc: 1–7. https://doi.org/10.1016/j.cois.2015.01.011.
Duthie, A Bradley, Karen C Abbott, and John D Nason. 2015. “Trade-offs and coexistence in fluctuating environments: evidence for a key dispersal-fecundity trade-off in five nonpollinating fig wasps.” American Naturalist 186 (1): 151–58. https://doi.org/10.1086/681621.
Duthie, A Bradley, and John D Nason. 2016. “Plant connectivity underlies plant-pollinator-exploiter distributions in Ficus petiolaris and associated pollinating and non-pollinating fig wasps.” Oikos, In press. https://doi.org/10.1111/oik.02629.
Janzen, Daniel H. 1979. “How to be a fig.” Annual Review of Ecology and Systematics 10 (1): 13–51. https://doi.org/10.1146/annurev.es.10.110179.000305.
Weiblen, George D. 2002. “How to be a fig wasp.” Annual Review of Entomology 47: 299–330.